A new feature selection algorithm based on binomial hypothesis testing for spam filtering
نویسندگان
چکیده
Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to the spam satisfies a given threshold or not. We have evaluated Bi-Test on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM), and compared it with four famous feature selection algorithms (information gain, v-statistic, improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than v-statistic and Poisson distribution, and produces comparable performance with information gain and improved Gini index in terms of F1 measure when Naïve Bayes classifier is used; it achieves comparable performance with the other methods when SVM classifier is used. Moreover, Bi-Test executes faster than the other four algorithms. 2011 Elsevier B.V. All rights reserved.
منابع مشابه
A Novel Hybrid Approach for Email Spam Detection based on Scatter Search Algorithm and K-Nearest Neighbors
Because cyberspace and Internet predominate in the life of users, in addition to business opportunities and time reductions, threats like information theft, penetration into systems, etc. are included in the field of hardware and software. Security is the top priority to prevent a cyber-attack that users should initially be detecting the type of attacks because virtual environments are not moni...
متن کاملA Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization
Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...
متن کاملThe study on the spam filtering technology based on Bayesian algorithm
This paper analyzed spam filtering technology, carried out a detailed study of Naive Bayes algorithm, and proposed the improved Naive Bayesian mail filtering technology. Improvement can be seen in text selection as well as feature extraction. The general Bayesian text classification algorithm mostly takes information gain and cross-entropy algorithm in feature selection. Through the principle o...
متن کاملA New Method for Characterization of Biological Particles in Microscopic Videos: Hypothesis Testing Based on a Combination of Stochastic Modeling and Graph Theory
Introduction Studying motility of biological objects is an important parameter in many biomedical processes. Therefore, automated analyzing methods via microscopic videos are becoming an important step in recent researches. Materials and Methods In the proposed method of this article, a hypothesis testing function is defined to separate biological particles from artifact and noise in captured v...
متن کاملFast SFFS-Based Algorithm for Feature Selection in Biomedical Datasets
Biomedical datasets usually include a large number of features relative to the number of samples. However, some data dimensions may be less relevant or even irrelevant to the output class. Selection of an optimal subset of features is critical, not only to reduce the processing cost but also to improve the classification results. To this end, this paper presents a hybrid method of filter and wr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Knowl.-Based Syst.
دوره 24 شماره
صفحات -
تاریخ انتشار 2011